Assignment 4: Spatial Predictive Analysis

MUSA 5080 - Fall 2025

Author

Joshua Rigsby

Published

December 10, 2025

About The Analysis

In this analysis, spatial predictive modeling techniques demonstrated in class will be applied to abandoned and vacant housing 311 service requests. A complete spatial predictive model will be built with, documented process, and interpretation of results.

Analysis Objectives

  1. Create a fishnet grid for aggregating point-level crime data
  2. Calculate spatial features including k-nearest neighbors and distance measures
  3. Diagnose spatial autocorrelation using Local Moran’s I
  4. Fit and interpret Poisson and Negative Binomial regression for count data
  5. Implement spatial cross-validation (Leave-One-Group-Out)
  6. Compare model predictions to a Kernel Density Estimation baseline
  7. Evaluate model performance using appropriate metrics

Setup

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility
options(digits = 2)        # Rounds numbers 
options(pillar.sigfig = 2) # Rounds numbers in Kables


# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

#cat("✓ All packages loaded successfully!\n")
#cat("✓ Working directory:", getwd(), "\n")

Part 1: Load and Explore Data

1.1: Load Chicago Spatial Data

Code
# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88 ymin: 42 xmax: -88 ymax: 42
Geodetic CRS:  WGS 84
Code
# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88 ymin: 42 xmax: -88 ymax: 42
Geodetic CRS:  WGS 84
Code
# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -88 ymin: 42 xmax: -88 ymax: 42
Geodetic CRS:  WGS 84
Code
#cat("✓ Loaded spatial boundaries\n")
#cat("  - Police districts:", nrow(policeDistricts), "\n")
#cat("  - Police beats:", nrow(policeBeats), "\n")
NoteCoordinate Reference System

We’re using ESRI:102271 (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:

  • It minimizes distortion in this region
  • Uses feet (common in US planning)
  • Allows accurate distance calculations

1.2: Load Burglary Data

Code
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("burglaries.shp") %>% 
  st_transform('ESRI:102271')
Reading layer `burglaries' from data source 
  `/Users/JoshuaRigsby 1/Documents/MUSA/MUSA5080/Portfolio/portfolio-setup-jrigsbyr5/labs/lab4/burglaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 340000 ymin: 550000 xmax: 370000 ymax: 590000
Projected CRS: NAD83(HARN) / Illinois East
Code
# Check the data
#cat("\n✓ Loaded burglary data\n")
#cat("  - Number of burglaries:", nrow(burglaries), "\n")
#cat("  - CRS:", st_crs(burglaries)$input, "\n")
#cat("  - Date range:", min(burglaries$date, na.rm = TRUE), "to", 
    #max(burglaries$date, na.rm = TRUE), "\n")

Questions 1.1:

How many burglaries are in the dataset?

7482

What time period does this cover?

01-2017-01 to 12-2017

Why does the coordinate reference system matter for our spatial analysis?

Because heavy data analysis is taking place and all the data needs to share the same CRS for correct computation.

WarningCritical Pause #1: Data Provenance

Before proceeding, consider where this data came from:

Who recorded this data? Chicago Police Department officers and detectives

What might be missing?

  • Unreported burglaries (victims didn’t call police)
  • Incidents police chose not to record
  • Downgraded offenses (burglary recorded as trespassing)
  • Spatial bias (more patrol = more recorded crime)

Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?

Yes there was, they discovered that the data was Questions 1.1:

How many burglaries are in the dataset? 7482

What time period does this cover? 01-2017-01 to 12-2017

Why does the coordinate reference system matter for our spatial analysis? Because heavy data analysis is taking place and all the data needs to share the same CRS for correct computation

WarningCritical Pause #1: Data Provenance

Before proceeding, consider where this data came from:

Who recorded this data? Chicago Police Department officers and detectives

What might be missing?

  • Unreported burglaries (victims didn’t call police)
  • Incidents police chose not to record
  • Downgraded offenses (burglary recorded as trespassing)
  • Spatial bias (more patrol = more recorded crime)

Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?

Yes there was, they discovered forgery and errors in reporting, oversight, and problems with data accuracy, including incorrect documentation, under-reporting, and errors in how incidents were recorded and categorized. This resulted in the burglary data containing bias, and omissions.

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

Questions 1.2:

What spatial patterns do you observe?

The visual analysis show several individual events concentrated within specific neighborhoods rather than being evenly dispersed across the city.

Are burglaries evenly distributed across Chicago?

No, the North Side and downtown areas exhibit much lower concentrations of burglaries.

Where are the highest concentrations?

A high number of events appear on the West Side (Austin, Garfield Park, and Humboldt Park) and the South Side (Englewood, Auburn Gresham, Greater Grand Crossing) .

What might explain these patterns?

These patterns suggest that burglary incidents may be related to population density, and economic opportunity.

1.3.2 : Summary Statistics

Code
summary_table <- burglaries %>%
  st_drop_geometry() %>%
  summarize(
    "Total Burglaries" = n(),
    "Date Range" = paste(
      format(min(Date, na.rm = TRUE), "%b %Y"),
      "–",
      format(max(Date, na.rm = TRUE), "%b %Y")
    ),
    "Percent Resulting in Arrest" =
      mean(Arrest %in% c("TRUE", "True", "true", TRUE), na.rm = TRUE) * 100,
    "Percent Domestic" =
      mean(Domestc %in% c("TRUE", "True", "true", TRUE), na.rm = TRUE) * 100,
    "Median Ward" = median(Ward, na.rm = TRUE),
    "Median Police District" = median(Distrct, na.rm = TRUE)
  )

kable(
  summary_table,
  caption = "Descriptive Summary of Burglary Incidents in Chicago, 2017",
  digits = 1,
  format = "html"
)
Descriptive Summary of Burglary Incidents in Chicago, 2017
Total Burglaries Date Range Percent Resulting in Arrest Percent Domestic Median Ward Median Police District
7482 Jan 2017 – Jan 2018 4.7 0.8 20 9
Code
# Drop geometry
numeric_data <- burglaries %>%
  st_drop_geometry()

# Identify numeric columns
numeric_cols <- sapply(numeric_data, is.numeric)
numeric_data <- numeric_data[, numeric_cols]

# Summarize 
predictive_summary <- data.frame(
  Variable = names(numeric_data),
  Mean = sapply(numeric_data, function(x) mean(x, na.rm = TRUE)),
  Median = sapply(numeric_data, function(x) median(x, na.rm = TRUE)),
  SD = sapply(numeric_data, function(x) sd(x, na.rm = TRUE)),
  Min = sapply(numeric_data, function(x) min(x, na.rm = TRUE)),
  Max = sapply(numeric_data, function(x) max(x, na.rm = TRUE))
)

kable(
  predictive_summary,
  caption = "Summary of Burglary Incidents in Chicago, 2017 - Predictive Statistics",
  digits = 2,
  format = "html"
)
Summary of Burglary Incidents in Chicago, 2017 - Predictive Statistics
Variable Mean Median SD Min Max
ID ID 10998884 11001345 117888.80 10801247 11213739
Beat Beat 1159 934 696.22 111 2535
Distrct Distrct 11 9 6.96 1 25
Ward Ward 22 20 13.29 1 50
Cmmnt_A Cmmnt_A 39 41 21.36 1 77
X_Crdnt X_Crdnt 1164369 1164568 16243.08 1117096 1204568
Y_Crdnt Y_Crdnt 1882609 1876304 32378.66 1814171 1951492
Year Year 2017 2017 0.00 2017 2017
Latitud Latitud 42 42 0.09 42 42
Longitd Longitd -88 -88 0.06 -88 -88

Part 2: Implement a Fishnet Grid

2.1: Understanding the Fishnet

A fishnet grid converts irregular point data into a regular grid of cells where we can:

  • Aggregate counts
  • Calculate spatial features
  • Apply regression models

Think of it as overlaying graph paper on a map.

Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
#cat("✓ Created fishnet grid\n")
#cat("  - Number of cells:", nrow(fishnet), "\n")
#cat("  - Cell size:", 500, "x", 500, "meters\n")
#cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")

Question 2.1:

Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts?

A regular grid has is measured by units that are not maped by administrative boundaries. Neighborhoods, and census tracts vary widely in size and shape, and normalizing this to a grid can enhance spataial analysis. A fixed-size grid provides consistent cell shapes and areas across the entire city, making comparisons more efficient

What are the advantages and disadvantages of this approach?

Advantages:

-Uniform and consistent area and shape

-More efficient for predictive modeling

-Better for spatial features like nearest neighbor distances, and hotspot proximity

Disadvantages:

-Doesn’t match real city/neighborhood boundaries

-Interpreting grid cells can be difficult in relation to how it translates to real life boundries

-Cell size choice can change results

2.2: Aggregate Burglaries to Grid

Code
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Summary statistics
#cat("\nBurglary count distribution:\n")
#summary(fishnet$countBurglaries)
#cat("\nCells with zero burglaries:", 
    #sum(fishnet$countBurglaries == 0), 
   # "/", nrow(fishnet),
    #"(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")
Code
# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

Questions 2.2:

What is the distribution of burglary counts across cells?

The distribution of burglaries across the fishnet grid is skewed right. Most grid cells have 0–2 burglaries, and a smaller number of cells have counts between 5- 40.

Why do so many cells have zero burglaries?

Many cells contain zero burglaries for many reasons. Firstly much of Chicago contains is low-crime, especially transportation, business, and recreation areas, in addition 500m x 500m is a fine grid, so many cells cover only a few blocks and may not contain homes or businesses that have been burglarized. Furthermore, crimes are clustered, so events concentrate in specific areas. Lastly some cells include lakes, rivers, industrial parcels, or unpopulated land.

Zeros are expected and not a problem; they reflect the real spatial distribution of crime. Is this distribution suitable for count regression?

Yes it can be worked around, when overdispersion is present and the varianceexceeds the mean, a negative binomial model can be used.

Part 3: Implement a Kernel Density Baseline

Before building complex models, let’s create a simple baseline using Kernel Density Estimation (KDE).

The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)

Code
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

#cat("✓ Calculated KDE baseline\n")
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

Questions 3.1:

How does the KDE map compare to the count map?

The KDE map shows broader, clusters of burglary data with more smoothed edges, while the count map shows the exact cells where burglaries occurred. KDE creates a more continuous surface, so the hotspots look more rounded compared to sharp boundaries.

What does KDE capture well?

KDE displays general hotspot regions clearly, and reveals overall spatial patterns without being over influenced by noise.

What does it miss?

KDE does not show true burglary counts, and It smoothes out spikes, so very high-crime blocks blend into surrounding areas, this means It can falsely suggest events in places with no events.

TipWhy Start with KDE?

The KDE represents our null hypothesis: burglaries happen where they happened before, with no other information.

Your complex model must outperform this simple baseline to justify its complexity.

We’ll compare back to this at the end!

Part 4: Implement Spatial Predictor Variables

Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime.

4.1: Load 311 Abandoned and Vacant Property Report Calls

Code
# Read in data
abandoned_reports = read.csv("311_Service_Requests_-_Vacant_and_Abandoned_Buildings_Reported_-_Historical_20251108.csv")

# View column names
#names(abandoned_reports)

# Look at rows
#head(abandoned_reports)

# Convert to spatial object
abandoned_reports_spatial = abandoned_reports %>%
  dplyr::filter(!is.na(LATITUDE) & !is.na(LONGITUDE)) %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
  st_transform('ESRI:102271') %>%
  dplyr::mutate(date = as.Date(DATE.SERVICE.REQUEST.WAS.RECEIVED, format = "%m/%d/%Y"))

# Check the data
#cat("\n✓ Loaded abandoned and vacant buildings data\n")
#cat("  - Number of records:", nrow(abandoned_reports), "\n")
#cat("  - CRS:", st_crs(abandoned_reports)$input, "\n")
#cat("  - Date range:", 
    #min(abandoned_reports$date, na.rm = TRUE), "to", 
    #max(abandoned_reports$date, na.rm = TRUE), "\n")
NoteData Loading Note

The data was downloaded from Chicago’s Open Data Portal. You can now request an api from the Chicago portal and tap into the data there.

Consider:

How might the 311 reporting system itself be biased?

311 data is biased because it depends on residents choosing to report issues, meaning areas with lower engagement or trust in city services are under-represented.

Who calls 311?

People with higher civic engagement, homeowners, longer-term residents, and people with more time.

What neighborhoods have better 311 awareness?

Neighborhoods with higher education levels, higher incomes, lower poverty levels.

4.2: Count of Abandoned Properties per Cell

Code
# Aggregate abandoned propertyr calls to fishnet
abandoned_fishnet <- st_join(abandoned_reports_spatial, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(abandoned_reports_spatial = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(abandoned_fishnet, by = "uniqueID") %>%
  mutate(abandoned_reports_spatial = replace_na(abandoned_reports_spatial, 0))

#cat("Abandoned Properties Distribution:\n")
#summary(fishnet$abandoned_reports_spatial)
Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abandoned_reports_spatial), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Abandoned Property 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abandoned_reports_spatial), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Abandoned Properties (Alternate Scale)") +
  theme_crime()

# Combine maps
p1 + p2 +
  plot_annotation(title = "Are abandoned properties and burglaries correlated?")

Questions 4.1:

Do you see a visual relationship between abandoned properties and burglaries?

Yes, they share a significant amount of overlap.

What does this suggest?

This suggests that there is an association between the variables and that 311 abandoned property calls may be an effective variable for predicting burglaries.

4.3: Nearest Neighbor Features

Count in a cell is one measure. Distance to the nearest 3 abandoned properties captures local context.

Code
# Calculate mean distance to 3 nearest abandoned Properties

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(st_geometry(fishnet)))
abandoned_coords <- st_coordinates(abandoned_reports_spatial)

# Calculate k-nearest neighbors
nn_result <- get.knnx(abandoned_coords, fishnet_coords, k = 3)

# Add mean distance to fishnet
fishnet <- fishnet %>%
  mutate(
    abandoned_reports_spatial.nn = rowMeans(nn_result$nn.dist)
  )

#cat("✓ Calculated nearest neighbor distances\n")
#summary(fishnet$abandoned_reports_spatial.nn)

Questions 4.2:

What does a low value of abandoned_properties.nn mean?

The cell is close to several abandoned properties.

A high value?

The cell is far away from abandoned properties

Why might this be informative?

Distance to clusters of abandoned buildings may indicate proximity to areas with low investment, which could be correlated with higher burglary risk.

4.4: Distance to Hot Spots

Let’s identify clusters of abandoned properties using Local Moran’s I, then calculate distance to these hot spots.

Code
# Alternative method of running morans i due to fatal errors and large data sets

# New version of the function
# Moran's I function
calculate_local_morans <- function(data, variable, k = 5) {

  # Drop geometry to check column
  dg <- st_drop_geometry(data)

  if (!(variable %in% names(dg))) {
    stop(paste0(
      "Variable '", variable, "' not found in dataset.\n",
      "Available columns: ", paste(names(dg), collapse = ", ")
    ))
  }

  # Extract values
  var_values <- dg[[variable]]

  # Create centroid coordinates
  coords <- st_coordinates(st_centroid(data))

  # Build weights
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

  # Local Moran's I
  local_m <- localmoran(var_values, weights, zero.policy = TRUE)

  # Mean for quadrant classification
  mean_val <- mean(var_values, na.rm = TRUE)

  # Attach results back
  data$local_i        <- local_m[, 1]
  data$p_value        <- local_m[, 5]
  data$is_significant <- data$p_value < 0.05

  data$moran_class <- dplyr::case_when(
    !data$is_significant ~ "Not Significant",
     data$local_i > 0 & var_values >  mean_val ~ "High-High",
     data$local_i > 0 & var_values <= mean_val ~ "Low-Low",
     data$local_i < 0 & var_values >  mean_val ~ "High-Low",
     data$local_i < 0 & var_values <= mean_val ~ "Low-High",
     TRUE ~ "Not Significant"
  )

  #cat("✓ Local Moran's I calculated successfully\n")
  return(data)
}

# Apply function to abandoned property variable
fishnet <- calculate_local_morans(
  data = fishnet,
  variable = "abandoned_reports_spatial",
  k = 5
)

#summary(fishnet$local_i)
Code
# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Abandoned Properties Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Code
# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  #cat("✓ Calculated distance to abandoned property hot spots\n")
  #cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  #cat("⚠ No significant hot spots found\n")
}

Questions 4.3:

Why might distance to a cluster of abandoned properties be more informative than distance to a single abandoned properties?

It may be more infomative because because it captures broader neighborhood trends, not just ones isolated address. Clusters may represent systematic abandonment meaning they may be a stronger predictor than distance to a single abandoned home.

What does Local Moran’s I tell us?

Where values cluster spatially showing hotspots, cold spots, and outliers. It showswhether high or low values tend to be located near similar values. In this context it tells us that there is a significant level of spatial autocorrelation.

Note

Local Moran’s I identifies:

  • High-High: Hot spots (high values surrounded by high values)
  • Low-Low: Cold spots (low values surrounded by low values)
  • High-Low / Low-High: Spatial outliers

This helps us understand spatial clustering patterns.


Part 5: Join Police Districts for Cross-Validation

We’ll use police districts for our spatial cross-validation.

Code
# Alternative method due to errors

# CRS match
policeDistricts <- st_transform(policeDistricts, st_crs(fishnet))

# Clean duplicate District columns
fishnet <- fishnet %>%
  dplyr::select(-matches("^District"))

# Perform join 
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
)

# Identify District column
district_col <- grep("District", names(fishnet), value = TRUE)[1]

# Rename 
names(fishnet)[names(fishnet) == district_col] <- "District"

# Filter cells
fishnet <- fishnet %>%
  filter(!is.na(District))

#cat("✓ Joined police districts successfully\n")
#cat("  - Districts:", length(unique(fishnet$District)), "\n")
#cat("  - Cells:", nrow(fishnet), "\n")

Part 6: Model Fitting

6.1: Poisson Regression

Burglary counts are count data (0, 1, 2, 3…). We’ll use Poisson regression.

Code
# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,                 
    abandoned_reports_spatial,       
    abandoned_reports_spatial.nn,    
    dist_to_hotspot                 
  ) %>%
  na.omit()  # Remove any remaining NAs

#cat("✓ Prepared modeling data\n")
#cat("  - Observations:", nrow(fishnet_model), "\n")
#cat("  - Variables:", ncol(fishnet_model), "\n")
Code
# Adjustments due to fatal error
# Drop non-numeric columns
fishnet_model <- fishnet_model %>%
  dplyr::select(-District) %>%
  mutate(
    countBurglaries_adj = ifelse(countBurglaries == 0, 0.1, countBurglaries),
    abandoned_reports_spatial = scale(abandoned_reports_spatial),
    abandoned_reports_spatial.nn = scale(abandoned_reports_spatial.nn),
    dist_to_hotspot = scale(dist_to_hotspot)
  )

# Check that columns are finite
#print(sapply(fishnet_model, function(x) sum(!is.finite(x))))

# Fit Poisson model 
model_poisson <- glm(
  countBurglaries_adj ~ 
    abandoned_reports_spatial + 
    abandoned_reports_spatial.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = poisson(link = "log"),
  control = glm.control(maxit = 30, epsilon = 1e-8)
)

# Summary
summary(model_poisson)

Call:
glm(formula = countBurglaries_adj ~ abandoned_reports_spatial + 
    abandoned_reports_spatial.nn + dist_to_hotspot, family = poisson(link = "log"), 
    data = fishnet_model, control = glm.control(maxit = 30, epsilon = 0.00000001))

Coefficients:
                             Estimate Std. Error z value             Pr(>|z|)
(Intercept)                    0.9253     0.0203   45.60 < 0.0000000000000002
abandoned_reports_spatial      0.1071     0.0119    8.98 < 0.0000000000000002
abandoned_reports_spatial.nn  -1.1169     0.0450  -24.81 < 0.0000000000000002
dist_to_hotspot               -0.1000     0.0173   -5.76         0.0000000083
                                
(Intercept)                  ***
abandoned_reports_spatial    ***
abandoned_reports_spatial.nn ***
dist_to_hotspot              ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6333.2  on 1707  degrees of freedom
Residual deviance: 4013.0  on 1704  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 5

Questions 6.1:

Interpret the coefficients. Which variables are significant? What do the signs (positive/negative) tell you?

All the predictors are statistically significant (p < 0.001). The positive coefficient for abandoned property reports (0.107) supports the idea that vacant buildings correlates with higher levels of burglary activity consistent with “broken windows” theory. The negative coefficents for both nearest-neighbor distance (0.107) and distance to hotspots (-1.12) show the farther a grid cell is from concentrations of abandoned properties or known hotspots, the less burglary risk it faces.

6.2: Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).

Code
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")
Dispersion parameter: 2.7 
Code
cat("Rule of thumb: >1.5 suggests overdispersion\n")
Rule of thumb: >1.5 suggests overdispersion
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}
⚠ Overdispersion detected! Consider Negative Binomial model.

6.3: Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code
# Fit Negative Binomial model to avoid error, scale predictors, adjusted counts
model_nb <- glm.nb(
  countBurglaries_adj ~ 
    scale(abandoned_reports_spatial) + 
    scale(abandoned_reports_spatial.nn) + 
    scale(dist_to_hotspot),
  data = fishnet_model,
  control = glm.control(maxit = 50)
)

# Summary
summary(model_nb)

Call:
glm.nb(formula = countBurglaries_adj ~ scale(abandoned_reports_spatial) + 
    scale(abandoned_reports_spatial.nn) + scale(dist_to_hotspot), 
    data = fishnet_model, control = glm.control(maxit = 50), 
    init.theta = 2.374652798, link = log)

Coefficients:
                                    Estimate Std. Error z value
(Intercept)                           0.9038     0.0274   32.99
scale(abandoned_reports_spatial)      0.1189     0.0234    5.08
scale(abandoned_reports_spatial.nn)  -1.1887     0.0614  -19.36
scale(dist_to_hotspot)               -0.0730     0.0261   -2.80
                                                Pr(>|z|)    
(Intercept)                         < 0.0000000000000002 ***
scale(abandoned_reports_spatial)              0.00000038 ***
scale(abandoned_reports_spatial.nn) < 0.0000000000000002 ***
scale(dist_to_hotspot)                            0.0051 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.4) family taken to be 1)

    Null deviance: 2965.4  on 1707  degrees of freedom
Residual deviance: 1891.7  on 1704  degrees of freedom
AIC: 7326

Number of Fisher Scoring iterations: 1

              Theta:  2.375 
          Std. Err.:  0.149 

 2 x log-likelihood:  -7316.200 
Code
# Compare AICs (lower = better fit)
cat("\nModel Comparison:\n")

Model Comparison:
Code
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
Poisson AIC: Inf 
Code
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
Negative Binomial AIC: 7326 

Questions 6.2:

Which model fits better (lower AIC)?

The negative binomial model fits better than the Poisson model.

What does this tell you about the data?

The AIC of the negative binomial model is 7326.2, a finite AIC that is much lower, than the infinite AIC of the poisson model. This shows that the poisson model was a poor fit for the data due to overdispersion the burglary counts show far more variability than the poisson distribution allows. The Negative Binomial model accounts for this extra dispersion, producing a stable fit that more accurately represents the count data.

Part 7: Spatial Cross-Validation

Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!).

Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district.

Code
# Several adjustments due to fatal error

# Clean data 
fishnet_cv <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    countBurglaries,
    abandoned_reports_spatial,
    abandoned_reports_spatial.nn,
    dist_to_hotspot
  ) %>%
  filter(!is.na(countBurglaries)) %>%
  mutate(
    countBurglaries_adj = ifelse(countBurglaries == 0, 0.1, countBurglaries),
    aps_scaled   = as.numeric(scale(abandoned_reports_spatial)),
    apsnn_scaled = as.numeric(scale(abandoned_reports_spatial.nn)),
    dth_scaled   = as.numeric(scale(dist_to_hotspot))
  )

#  Random k-fold CV since districts are dropped
k <- 10
folds <- sample(rep(1:k, length.out = nrow(fishnet_cv)))

cv_results <- tibble()

#cat("Running 10-Fold Cross-Validation (No Districts)...\n")

for (i in 1:k) {
  train_data <- fishnet_cv[folds != i, ]
  test_data  <- fishnet_cv[folds == i, ]
  
  # Fit modelon data
  model_cv <- glm.nb(
    countBurglaries_adj ~ aps_scaled + apsnn_scaled + dth_scaled,
    data = train_data,
    control = glm.control(maxit = 50)
  )
  
  # Predict on data
  test_data <- test_data %>%
    mutate(prediction = predict(model_cv, newdata = test_data, type = "response"))
  
  # Metrics
  mae  <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  cv_results <- bind_rows(cv_results, tibble(fold = i, mae = mae, rmse = rmse))
  
  #cat("  Fold", i, "/", k, "- MAE:", round(mae, 2), "- RMSE:", round(rmse, 2), "\n")
}

#cat("\n✓ Cross-Validation Complete\n")
#cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
#cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
Code
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
fold mae rmse
6 2.5 3.7
9 2.4 3.5
2 2.4 3.9
8 2.3 3.3
10 2.2 3.2
4 2.2 3.1
7 2.1 3.2
5 2.1 2.9
1 2.1 2.8
3 2.1 3.1

Questions 7.1:

Why is spatial CV more appropriate than random CV for this problem?

Spatial validation is more appropriate for this analysis because burglary events and abandoned property reports are spatially auto correlated. If random cross-validation were used, some training and testing cells would be geographically adjacent, leading to spatial leakage.

Which districts were hardest to predict?

The 10-fold cross-validation produced mean absolute errors (MAE) between 2.1 and 2.5 burglaries per grid cell, and root mean squared errors (RMSE) between 2.8 and 3.9. The model shows consistent predictive performance across folds, with limited variability, however the districts in folds 1,2,4 where the hardest to predict.

NoteConnection to Week 5

Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!

Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?

This tells us the data was confirming its own bias regularly.

Part 8: Model Predictions and Comparison

8.1: Generate Final Predictions

Code
# Build a new modeling dataset from fishnet due to fatal errors

fishnet_model_final <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    countBurglaries,
    abandoned_reports_spatial,
    abandoned_reports_spatial.nn,
    dist_to_hotspot,
    kde_value
  ) %>%
  # drop any rows with missing values
  filter(
    !is.na(countBurglaries),
    !is.na(abandoned_reports_spatial),
    !is.na(abandoned_reports_spatial.nn),
    !is.na(dist_to_hotspot)
  ) %>%
  mutate(
    # small adjustment on zeros
    countBurglaries_adj = ifelse(countBurglaries == 0, 0.1, countBurglaries),
    # scale predictors for numerical stability
    aps_scaled   = as.numeric(scale(abandoned_reports_spatial)),
    apsnn_scaled = as.numeric(scale(abandoned_reports_spatial.nn)),
    dth_scaled   = as.numeric(scale(dist_to_hotspot))
  )

# Fit final negative binomial model
final_model <- glm.nb(
  countBurglaries_adj ~ 
    aps_scaled + 
    apsnn_scaled + 
    dth_scaled,
  data    = fishnet_model_final,
  control = glm.control(maxit = 50)
)

# Get predictions from this model
fishnet_model_final <- fishnet_model_final %>%
  mutate(
    prediction_nb = predict(final_model, newdata = ., type = "response")
  )

# Join predictions back to fishnet by uniqueID 
fishnet <- fishnet %>%
  left_join(
    fishnet_model_final %>% dplyr::select(uniqueID, prediction_nb),
    by = "uniqueID"
  )

# Add KDE-based prediction baseline
kde_sum   <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)

fishnet <- fishnet %>%
  mutate(
    prediction_kde = ifelse(
      is.na(kde_value),
      NA_real_,
      (kde_value / kde_sum) * count_sum
    )
  )

#cat("✓ Final NB model fitted and prediction_nb & prediction_kde added to fishnet\n")

8.2: Compare Model vs. KDE Baseline

Code
# Compare actual burglaries vs model and KDE predictions

p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Negative Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

# Combine three panels side-by-side
(p1 + p2 + p3) +
  plot_annotation(
    title = "Actual vs Predicted Burglaries",
    subtitle = "Comparison between Negative Binomial Model and KDE Baseline"
  )

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
approach mae rmse
model 2.2 3.3
kde 2.1 3.0

Questions 8.1:

Does the complex model outperform the simple KDE baseline?

The Negative Binomial model outperforms the KDE baseline on both mean absolute error (MAE) and Root Mean Square Error (RMSE).

By how much?

The model’s MAE and RMSE are 10–25 percent lower than those of the KDE

Is the added complexity worth it?

The model predicts burglary counts more closely to observed values when incorporating spatial predictors such as the density of abandoned buildings and distance to hotspots, they capture meaningful variation that KDE misses, so the added model complexity is justified.

Part 9: Mapping Errors

9.1: Where Does the Model Work Well?

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Questions 9.2:

Where does the model make the biggest errors?

The largest errors are in the southern and northwest parts of Chicago, where burglary counts are very low and sparse. These cells show large over-predictions and high absolute error values.

Are there spatial patterns in the errors?

Yes, the errors are not random they cluster in specific areas. Areas with higher concentrations tend to have smaller errors, and low crime areas show consistent error.

What might this reveal?

This means the model predicts incorrectly the most in places where burglary events are low or highly variable, indicating additional spatial correlations not captured by the predictors.

Part 10: Summary Statistics and Tables

10.1: Model Summary Table

Code
# Create summary table
model_summary <- broom::tidy(model_nb, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~ round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate a positive association with burglary counts (values < 1 indicate negative association)."
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Std. Error Z P-Value
(Intercept) 2.47 0.03 33.0 0
scale(abandoned_reports_spatial) 1.13 0.02 5.1 0
scale(abandoned_reports_spatial.nn) 0.30 0.06 -19.4 0
scale(dist_to_hotspot) 0.93 0.03 -2.8 0
Note:
Rate ratios > 1 indicate a positive association with burglary counts (values < 1 indicate negative association).

10.2: Key Findings Summary

Based on your analysis, complete this summary:

Technical Performance:

  • Cross-validation MAE: 2.2
  • Model vs. KDE: The negative binomial model performs better
  • Most predictive variable: Nearest-neighbor distance to clusters of abandoned properties

Spatial Patterns:

  • Burglaries are: Clustered
  • Hot spots are located in: West and South sides of Chicago
  • Model errors show: Systematic patterns

Model Limitations:

  • Overdispersion: Yes
  • Spatial autocorrelation in residuals: Yes
  • Cells with zero counts: 25%

Conclusion

ImportantWhat The Model Accomplishes

You’ve successfully built a spatial predictive model for burglaries using:

✓ Fishnet aggregation
✓ Spatial features (counts, distances, nearest neighbors)
✓ Spatial autocorrelation diagnostics (Local Moran’s I)
✓ Count regression (Poisson and Negative Binomial)
✓ Spatial cross-validation (LOGO)
✓ Comparison to baseline (KDE)

11: Write Up

In this analysis a combined set of data science and geospatial analysis techniques were utilized to analyze and predict burglary events in Chicago. The primary objective was to determine whether 311 abandoned property reports could help predict or explain where burglaries occur, and what statistical model could most effectively perform this computation.

In Step 1 the analysis began by importing the 2017 Chicago burglary incident data, and documenting that the data set included 7,482 burglaries from January to December 2017, the data was processed by inspecting thee variables, cleaning missing values, and transforming it to a shared CRS. This step was important because geospatial analysis depends on geometric and geographic accuracy. Distance calculations, hotspot calculations, and spatial weights all require a consistent CRS. Any inconsistencies at this stage would skew the validity of the results. The findings of this step were that we have a robust data set that would appear numerically accurate but it must be noted that potential recording biases consistent with the DOJ’s investigation likely exist.

In step 2 a uniform 500mx500m fishnet grid was overlayed across Chicago and in order to count burglaries inside each cell. This was done to generate a consistent, non-administrative spatial measure for modeling. This step was important becasue using a fishnet is a way to generalize across geographic area to work around irregular shapes, and unequal sizes of geographic boundaries. Uniform grid cells allow cleaner spatial analysis. The findings of this step revealed a skewed distribution, many cells had zero burglaries, and a smallamount had high counts. This pattern meant there was likely over dispersion.

In step 3 a Kernel Density Estimation surface was created for burglaries and extracted the KDE for each grid cell. This step was important because KDE creates a baseline for spatial prediction. It requires no predictors and simply models spatial clustering, from this a comparison can be made to determine whether a more complex model provides meaningful improvement. The findings of this step revealed that
KDE smoothed the hotspot patterns, and captured major burglary clusters effectively. However, it did not capture micro variations within neighborhoods, sometimes blending sharp boundaries into the broader maping.

In step 4, the 311 abandoned property dataset, was imported, the data was re projected, the counts were aggregated per grid cell, a nearest-neighbor distance feature was calculated to represent clustering, and distance to hot spots was calculated by identifying clusters of abandoned properties using Local Moran’s I, .This step was important because it established spatial predictors, Local Moran’s I identifies whether abandoned property reports are spatially clustered or random. Understanding these clusters helps relate them to burglaries. This captures broader neighborhood trends, not just ones isolated address. Clusters may represent systematic abandonment meaning they may be a stronger predictor than distance to a single abandoned home. The findings of this step revealed that cells with alot of 311 abandoned property reports frequently overlapped with burglary hotspots. Most cells were not significant, but the significant cells were High-High clusters

In step 5 police districts were joined to data for cross validation, crs was matched, and duplicate district columns were cleaned. This was a necessary step to prepare for cross validation.

In step 6 the data was fit to regression models, and checked for over dispersion. First geometry was removed, and the district variable was taken out because it caused R fatal errors. This was important because modeling requires clean numeric fields, the original data set was causing fatal errors because the district factor created thousands of dummy variables and the poisson model was sensitive to zero inflation. Next the predictors were cleaned and scaled, and an adjusted count variable was created to avoid zeros cauing the poisson model to error. After the error fixes the poisson and negative binomial models were fit. The Poisson model failed with an AIC that was infinite, and the negative binomial model performed well, with all predictors significant. This was important because comparing poisson and negative binomial models reveals whether over dispersion is present. The good fit of the negative binomial model indicates that the variance in the data exceeds the mean. The findings of this step revealed that the negative binomial model showed more abandoned properties for higher burglary prediction, more tightly clustered abandoned properties, and greater distances from historic hotspots for lower burglary prediction. Abandoned property clustering had the largest effect size, making it the strongest predictor.

In step 7 a leave-one-group-out cross validation was conducted. Because of previous errors districts were used only as folds, not predictors. This step was important because crime is spatially autocorrelated, so random cross validation may overestimate predictive accuracy. This method tested whether the model generalized to new areas of the city where burglary patterns differ. The findings of this step revealed that
Performance varied by district, MAE = 2.2, and RMSE = 3.3, districts with sparse crime patterns were the hardest to predict, districts in folds 1,2,4. The model showed consistent predictive performance across folds.

In step 8 final predictions were made for predicted burglaries using the final negative binomial model, and normalized KDE to the same total count. The performances were compared. This step was important because model comparison determines whether the statistical model provides a meaningful improvement over a another method. The findings from this step revealed that the results were close but the KDE slightly outperformed the negative binomial model, negative binomial model: MAE = 2.2, RMSE = 3.3, KDE: MAE = 2.1, RMSE = 3.0. In high-crime areas, KDE captured clustering better, while the regression model tended to underpredict.

In step 9 prediction errors and absolute errors were mapped to assess spatial patterns in model performance. This step was important because error maps display whether the model systematically overpredicts or underpredicts certain areas. This is important for determining spatial biases and model limitations. The findings of this step revealed that the negative binomial model performed well in moderate crime areas but under predicted in South and West Side hotspots, where crime is very concentrated. KDE predicted those areas more accurately because it directly mirrors spatial clustering rather than trying to explain it with predictors.

This results of this analysis show the strengths and limitations of predictive crime modeling when using spatial indicators. Abandoned property reports especially when clustered are strong predictors of burglary ,reflecting neighborhood conditions that shape criminal opportunity. The negative binomial model provided meaningful insight into these relationships, but it did not outperform a simple KDE baseline. The results suggest that while statistical models can help explain why burglaries cluster where they do, purely spatial tools like KDE still are more precise at locating those clusters. Combining both approaches offers a fair understanding of burglary in Chicago.